Email IDs:
Raj Chavan : rchavan@iu.edu
Sanket Bailmare: sbailmar@iu.edu
Shefali Luley: sluley@iu.edu
Tanay Kulkarni : tankulk@iu.edu
Group : 23
Members: Raj Chavan Sanket Bailmare Shefali Luley Tanay Kulkarni
In today’s world, many people struggle to get loans due to insufficient credit histories or even non-existing credit records, which often tend to untrustworthy lenders who take advantage of the population. Home Credit acts towards expanding the financial inclusion for the unbanked by providing a secure borrowing experience. Home credit utilizes several alternative data and methods such as clients' background facts, their transactional information to predict their repayment abilities. To ensure that this underserved demographic has a favorable loan experience, we will be using machine learning and statistical methods to determine these predictions. With the help of this, Home credit will ensure that the clients who are capable of repayment will be granted a loan and are not rejected by any means. Also, they will be given a loan maturity plan and a repayment calendar that will accredit our clients to be more successful. In the previous stage, we understood the problem and data, after which we devised our plan of action. Our goal in this phase is to work on the Home Credit Default Risk (HCDR) data and perform some cleaning and preprocessing techniques as well as build the baseline pipeline. We begin with understanding the data, basically to know, what kind of data we are presented with and how many rows and columns in what form are present in the data. We do this by using various functions to get descriptive statistics of each column. After understanding the data, we clean it. This is performed by removing the columns with more than 50% NaN (Null) values. Data cleaning is an important step as it will help us better comprehend what we want to achieve. After cleaning the data, we apply some preprocessing steps which include but are not limited to finding a correlation, dropping columns, finding the mean and median, and also dividing the data into numerical and categorical values. We create a baseline pipeline for the numerical and categorical values and eventually merge them using another pipeline. Finally, we use this data to train our model. We also perform visual EDA to get the results and get more useful insights from the data.
Home Credit is a dataset provided by Home Credit Default risk, a service dedicated to providing lines of credit (loans) to the unbanked population. application.csv: This is the main dataset divided into train and test datasets, it also contains information about loan and loan applicants at their application time. bureau.csv: This file contains information about the loan history of clients from institutes and which were reported to the Credit Bureau. Furthermore, there is one row focused per client's loan in the Credit Bureau. bureau_balance.csv: This file contains monthly balances of earlier credits in the Credit Bureau. previous_application.csv: This file contains information about the applicant's previous loan in Home credit, also it consists of information about previous loan parameters and the client's information at that particular time. POS_CASH_balance.csv: This file contains the monthly balance (snapshots) of the previous point of sales (POS) and loans in the form of cash that the applicant had with Home Credit. installments_payments.csv: This file contains the previous payment history of clients for each installment for the earlier credits in Home Credit related to the loan in our sample. credit_card_balance.csv: This file contains the monthly balance (snapshots) of clients' previous credit card history with Home Credit.
DataSet Link: https://www.kaggle.com/c/home-credit-default-risk/data
The task to be tackled:
Diagram :
This is a block diagram to understand the workflow of the data.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
import pandas as pd
df = pd.read_csv('drive/MyDrive/application_train.csv')
df_test = pd.read_csv('drive/MyDrive/application_test.csv')
Data description:
df.info() function provides description about the dataset like the number of columns and rows as well as the types of data present in the dataset.
df_test.describe() function will give the summary statistics and valuable information like count, mean, minimum, maximum and so on.
Feature Engineering:
The first step to deal with the data was to remove the columns which would act as redundant as it would not contribute in prediction.We explored the data and saw the number of missing values. We removed the columns which had more than 50% of missing values. We checked the columns for the number of 0's distribution and removed the columns which has 90% rows with only values as 0's. Further more we divided the data to identify if it is Numerical and categorical. The numerical data was dealt by creating an intermediate Imputer pipleine where the numerical missing values were replaced with the mean of the the data and the missing values in Categorical missing data was dealt by performing OHE(One hot Encoding) and replacing the missing values with Mode of the columns.
df.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ... | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -637 | -3648.0 | -2120 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0.0205 | 0.0193 | 0.0000 | 0.00 | reg oper account | block of flats | 0.0149 | Stone, brick | No | 2.0 | 2.0 | 2.0 | 2.0 | -1134.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1188 | -1186.0 | -291 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0.0787 | 0.0558 | 0.0039 | 0.01 | reg oper account | block of flats | 0.0714 | Block | No | 1.0 | 0.0 | 1.0 | 0.0 | -828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -225 | -4260.0 | -2531 | 26.0 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.0 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | 0.0 | 2.0 | 0.0 | -617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -3038 | -4311.0 | -3458 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1.0 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
df.describe()
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | ... | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511.000000 | 3.075110e+05 | 3.075110e+05 | 307499.000000 | 3.072330e+05 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 104582.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307509.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 134133.000000 | 3.068510e+05 | 246546.000000 | 151450.00000 | 127568.000000 | 157504.000000 | 103023.000000 | 92646.000000 | 143620.000000 | 152683.000000 | 154491.000000 | ... | 152683.000000 | 154491.000000 | 98869.000000 | 124921.000000 | 97312.000000 | 153161.000000 | 93997.000000 | 137829.000000 | 159080.000000 | 306490.000000 | 306490.000000 | 306490.000000 | 306490.000000 | 307510.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.00000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 |
| mean | 278180.518577 | 0.080729 | 0.417052 | 1.687979e+05 | 5.990260e+05 | 27108.573909 | 5.383962e+05 | 0.020868 | -16036.995067 | 63815.045904 | -4986.120328 | -2994.202373 | 12.061091 | 0.999997 | 0.819889 | 0.199368 | 0.998133 | 0.281066 | 0.056720 | 2.152665 | 2.052463 | 2.031521 | 12.063419 | 0.015144 | 0.050769 | 0.040659 | 0.078173 | 0.230454 | 0.179555 | 0.502130 | 5.143927e-01 | 0.510853 | 0.11744 | 0.088442 | 0.977735 | 0.752471 | 0.044621 | 0.078942 | 0.149725 | 0.226282 | ... | 0.149213 | 0.225897 | 0.231625 | 0.067169 | 0.101954 | 0.108607 | 0.008651 | 0.028236 | 0.102547 | 1.422245 | 0.143421 | 1.405292 | 0.100049 | -962.858788 | 0.000042 | 0.710023 | 0.000081 | 0.015115 | 0.088055 | 0.000192 | 0.081376 | 0.003896 | 0.000023 | 0.003912 | 0.000007 | 0.003525 | 0.002936 | 0.00121 | 0.009928 | 0.000267 | 0.008130 | 0.000595 | 0.000507 | 0.000335 | 0.006402 | 0.007000 | 0.034362 | 0.267395 | 0.265474 | 1.899974 |
| std | 102790.175348 | 0.272419 | 0.722121 | 2.371231e+05 | 4.024908e+05 | 14493.737315 | 3.694465e+05 | 0.013831 | 4363.988632 | 141275.766519 | 3522.886321 | 1509.450419 | 11.944812 | 0.001803 | 0.384280 | 0.399526 | 0.043164 | 0.449521 | 0.231307 | 0.910682 | 0.509034 | 0.502737 | 3.265832 | 0.122126 | 0.219526 | 0.197499 | 0.268444 | 0.421124 | 0.383817 | 0.211062 | 1.910602e-01 | 0.194844 | 0.10824 | 0.082438 | 0.059223 | 0.113280 | 0.076036 | 0.134576 | 0.100049 | 0.144641 | ... | 0.100368 | 0.145067 | 0.161934 | 0.082167 | 0.093642 | 0.112260 | 0.047415 | 0.070166 | 0.107462 | 2.400989 | 0.446698 | 2.379803 | 0.362291 | 826.808487 | 0.006502 | 0.453752 | 0.009016 | 0.122010 | 0.283376 | 0.013850 | 0.273412 | 0.062295 | 0.004771 | 0.062424 | 0.002550 | 0.059268 | 0.054110 | 0.03476 | 0.099144 | 0.016327 | 0.089798 | 0.024387 | 0.022518 | 0.018299 | 0.083849 | 0.110757 | 0.204685 | 0.916002 | 0.794056 | 1.869295 |
| min | 100002.000000 | 0.000000 | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1615.500000 | 4.050000e+04 | 0.000290 | -25229.000000 | -17912.000000 | -24672.000000 | -7197.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.014568 | 8.173617e-08 | 0.000527 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -4292.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189145.500000 | 0.000000 | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | 2.385000e+05 | 0.010006 | -19682.000000 | -2760.000000 | -7479.500000 | -4299.000000 | 5.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 2.000000 | 10.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.334007 | 3.924574e-01 | 0.370650 | 0.05770 | 0.044200 | 0.976700 | 0.687200 | 0.007800 | 0.000000 | 0.069000 | 0.166700 | ... | 0.069000 | 0.166700 | 0.083300 | 0.018700 | 0.051300 | 0.045700 | 0.000000 | 0.000000 | 0.041200 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -1570.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278202.000000 | 0.000000 | 0.000000 | 1.471500e+05 | 5.135310e+05 | 24903.000000 | 4.500000e+05 | 0.018850 | -15750.000000 | -1213.000000 | -4504.000000 | -3254.000000 | 9.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 2.000000 | 12.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.505998 | 5.659614e-01 | 0.535276 | 0.08760 | 0.076300 | 0.981600 | 0.755200 | 0.021100 | 0.000000 | 0.137900 | 0.166700 | ... | 0.137900 | 0.166700 | 0.208300 | 0.048700 | 0.076100 | 0.074900 | 0.000000 | 0.003100 | 0.068800 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -757.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367142.500000 | 0.000000 | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34596.000000 | 6.795000e+05 | 0.028663 | -12413.000000 | -289.000000 | -2010.000000 | -1720.000000 | 15.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 3.000000 | 2.000000 | 2.000000 | 14.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.675053 | 6.636171e-01 | 0.669057 | 0.14850 | 0.112200 | 0.986600 | 0.823200 | 0.051500 | 0.120000 | 0.206900 | 0.333300 | ... | 0.206900 | 0.333300 | 0.375000 | 0.086800 | 0.123100 | 0.130300 | 0.003900 | 0.026600 | 0.127600 | 2.000000 | 0.000000 | 2.000000 | 0.000000 | -274.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456255.000000 | 1.000000 | 19.000000 | 1.170000e+08 | 4.050000e+06 | 258025.500000 | 4.050000e+06 | 0.072508 | -7489.000000 | 365243.000000 | 0.000000 | 0.000000 | 91.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 20.000000 | 3.000000 | 3.000000 | 23.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.962693 | 8.549997e-01 | 0.896010 | 1.00000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 348.000000 | 34.000000 | 344.000000 | 24.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
8 rows × 106 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB
df_test.head()
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | ... | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | Unaccompanied | Working | Higher education | Married | House / apartment | 0.018850 | -19241 | -2329 | -5170.0 | -812 | NaN | 1 | 1 | 0 | 1 | 0 | 1 | NaN | 2.0 | 2 | 2 | TUESDAY | 18 | 0 | 0 | 0 | 0 | 0 | 0 | Kindergarten | ... | NaN | 0.0514 | NaN | NaN | NaN | block of flats | 0.0392 | Stone, brick | No | 0.0 | 0.0 | 0.0 | 0.0 | -1740.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.035792 | -18064 | -4469 | -9118.0 | -1623 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Low-skill Laborers | 2.0 | 2 | 2 | FRIDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Self-employed | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | NaN | Working | Higher education | Married | House / apartment | 0.019101 | -20038 | -4458 | -2175.0 | -3503 | 5.0 | 1 | 1 | 0 | 1 | 0 | 0 | Drivers | 2.0 | 2 | 2 | MONDAY | 14 | 0 | 0 | 0 | 0 | 0 | 0 | Transport: type 3 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -856.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.026392 | -13976 | -1866 | -2000.0 | -4208 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Sales staff | 4.0 | 2 | 2 | WEDNESDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | ... | 0.2446 | 0.3739 | 0.0388 | 0.0817 | reg oper account | block of flats | 0.3700 | Panel | No | 0.0 | 0.0 | 0.0 | 0.0 | -1805.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | 0.010032 | -13040 | -2191 | -4000.0 | -4262 | 16.0 | 1 | 1 | 1 | 1 | 0 | 0 | NaN | 3.0 | 2 | 2 | FRIDAY | 5 | 0 | 0 | 0 | 0 | 1 | 1 | Business Entity Type 3 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -821.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 121 columns
df_test.describe()
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | ... | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 48744.000000 | 48744.000000 | 4.874400e+04 | 4.874400e+04 | 48720.000000 | 4.874400e+04 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 16432.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 28212.000000 | 48736.000000 | 40076.000000 | 24857.000000 | 21103.000000 | 25888.000000 | 16926.000000 | 15249.000000 | 23555.000000 | 25165.000000 | 25423.000000 | 16278.000000 | ... | 25165.000000 | 25423.000000 | 16278.000000 | 20490.000000 | 15964.000000 | 25192.000000 | 15397.000000 | 22660.000000 | 26120.000000 | 48715.000000 | 48715.000000 | 48715.000000 | 48715.000000 | 48744.000000 | 48744.0 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.0 | 48744.000000 | 48744.0 | 48744.0 | 48744.0 | 48744.0 | 48744.0 | 48744.0 | 48744.000000 | 48744.0 | 48744.0 | 48744.0 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 |
| mean | 277796.676350 | 0.397054 | 1.784318e+05 | 5.167404e+05 | 29426.240209 | 4.626188e+05 | 0.021226 | -16068.084605 | 67485.366322 | -4967.652716 | -3051.712949 | 11.786027 | 0.999979 | 0.809720 | 0.204702 | 0.998400 | 0.263130 | 0.162646 | 2.146767 | 2.038159 | 2.012596 | 12.007365 | 0.018833 | 0.055166 | 0.042036 | 0.077466 | 0.224664 | 0.174216 | 0.501180 | 0.518021 | 0.500106 | 0.122388 | 0.090065 | 0.978828 | 0.751137 | 0.047624 | 0.085168 | 0.151777 | 0.233706 | 0.238423 | ... | 0.151200 | 0.233154 | 0.237846 | 0.068069 | 0.107063 | 0.113368 | 0.008979 | 0.029296 | 0.107129 | 1.447644 | 0.143652 | 1.435738 | 0.101139 | -1077.766228 | 0.0 | 0.786620 | 0.000103 | 0.014751 | 0.087477 | 0.000041 | 0.088462 | 0.004493 | 0.0 | 0.001169 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.001559 | 0.0 | 0.0 | 0.0 | 0.002108 | 0.001803 | 0.002787 | 0.009299 | 0.546902 | 1.983769 |
| std | 103169.547296 | 0.709047 | 1.015226e+05 | 3.653970e+05 | 16016.368315 | 3.367102e+05 | 0.014428 | 4325.900393 | 144348.507136 | 3552.612035 | 1569.276709 | 11.462889 | 0.004529 | 0.392526 | 0.403488 | 0.039971 | 0.440337 | 0.369046 | 0.890423 | 0.522694 | 0.515804 | 3.278172 | 0.135937 | 0.228306 | 0.200673 | 0.267332 | 0.417365 | 0.379299 | 0.205142 | 0.181278 | 0.189498 | 0.113112 | 0.081536 | 0.049318 | 0.113188 | 0.082868 | 0.139164 | 0.100669 | 0.147361 | 0.164976 | ... | 0.100931 | 0.147629 | 0.165241 | 0.082869 | 0.099737 | 0.116503 | 0.048148 | 0.072998 | 0.111420 | 3.608053 | 0.514413 | 3.580125 | 0.403791 | 878.920740 | 0.0 | 0.409698 | 0.010128 | 0.120554 | 0.282536 | 0.006405 | 0.283969 | 0.066879 | 0.0 | 0.034176 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.039456 | 0.0 | 0.0 | 0.0 | 0.046373 | 0.046132 | 0.054037 | 0.110924 | 0.693305 | 1.838873 |
| min | 100001.000000 | 0.000000 | 2.694150e+04 | 4.500000e+04 | 2295.000000 | 4.500000e+04 | 0.000253 | -25195.000000 | -17463.000000 | -23722.000000 | -6348.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | -1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.013458 | 0.000008 | 0.000527 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -4361.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 188557.750000 | 0.000000 | 1.125000e+05 | 2.606400e+05 | 17973.000000 | 2.250000e+05 | 0.010006 | -19637.000000 | -2910.000000 | -7459.250000 | -4448.000000 | 4.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 2.000000 | 10.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.343695 | 0.408066 | 0.363945 | 0.061900 | 0.046700 | 0.976700 | 0.687200 | 0.008100 | 0.000000 | 0.074500 | 0.166700 | 0.104200 | ... | 0.069000 | 0.166700 | 0.083300 | 0.019000 | 0.051300 | 0.049000 | 0.000000 | 0.000000 | 0.043200 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -1766.250000 | 0.0 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 277549.000000 | 0.000000 | 1.575000e+05 | 4.500000e+05 | 26199.000000 | 3.960000e+05 | 0.018850 | -15785.000000 | -1293.000000 | -4490.000000 | -3234.000000 | 9.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 2.000000 | 12.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.506771 | 0.558758 | 0.519097 | 0.092800 | 0.078100 | 0.981600 | 0.755200 | 0.022700 | 0.000000 | 0.137900 | 0.166700 | 0.208300 | ... | 0.137900 | 0.166700 | 0.208300 | 0.048800 | 0.077000 | 0.077600 | 0.000000 | 0.003100 | 0.070700 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -863.000000 | 0.0 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| 75% | 367555.500000 | 1.000000 | 2.250000e+05 | 6.750000e+05 | 37390.500000 | 6.300000e+05 | 0.028663 | -12496.000000 | -296.000000 | -1901.000000 | -1706.000000 | 15.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 3.000000 | 2.000000 | 2.000000 | 14.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.665956 | 0.658497 | 0.652897 | 0.148500 | 0.113400 | 0.986600 | 0.816400 | 0.053900 | 0.160000 | 0.206900 | 0.333300 | 0.375000 | ... | 0.206900 | 0.333300 | 0.375000 | 0.088000 | 0.126600 | 0.137425 | 0.003900 | 0.028025 | 0.135700 | 2.000000 | 0.000000 | 2.000000 | 0.000000 | -363.000000 | 0.0 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 3.000000 |
| max | 456250.000000 | 20.000000 | 4.410000e+06 | 2.245500e+06 | 180576.000000 | 2.245500e+06 | 0.072508 | -7338.000000 | 365243.000000 | 0.000000 | 0.000000 | 74.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 21.000000 | 3.000000 | 3.000000 | 23.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.939145 | 0.855000 | 0.882530 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 354.000000 | 34.000000 | 351.000000 | 24.000000 | 0.000000 | 0.0 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.0 | 1.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000000 | 0.0 | 0.0 | 0.0 | 2.000000 | 2.000000 | 2.000000 | 6.000000 | 7.000000 | 17.000000 |
8 rows × 105 columns
df_test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48744 entries, 0 to 48743 Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(40), object(16) memory usage: 45.0+ MB
Declaring some functions
def missing_per(df):
missing_vals = df.isnull().sum(axis=0)*100/len(df)
return missing_vals.sort_values(ascending=False)
def remove_cols(df,rem_cols):
df.drop(columns = rem_cols,inplace=True)
return df
# p = missing_per(df)
# cols_to_remove = p.reset_index()
# cols_to_remove.columns = ['col_name','flag']
# cols_to_remove_ = cols_to_remove[cols_to_remove['flag']>50]
# rem_cols = list(cols_to_remove_['col_name'])
def get_unique_in_a_column(df,n):
u = []
for cols in df.columns:
if cols=='TARGET':
continue
u.append([cols,df[cols].nunique()])
df_temp = pd.DataFrame(u,columns=['col_name','unique_values'])
return pd.DataFrame(df_temp[df_temp['unique_values']<n])
def seg_num_cat(df_phase3):
dfp3_numerical= df_phase3.select_dtypes(exclude='object')
dfp3_numerical['TARGET'] = df_phase3['TARGET']
dfp3_categorical= df.select_dtypes(include='object')
return dfp3_numerical,dfp3_categorical
# def mean_mode(df_phase3):
# cols_req_work=[]
# mean_cols = ['YEARS_BEGINEXPLUATATION_AVG','YEARS_BEGINEXPLUATATION_MEDI','YEARS_BEGINEXPLUATATION_MODE','TOTALAREA_MODE','EXT_SOURCE_3','EXT_SOURCE_2', 'AMT_GOODS_PRICE', 'AMT_ANNUITY','CNT_FAM_MEMBERS','DAYS_LAST_PHONE_CHANGE']
# mode_cols = ['NAME_TYPE_SUITE','DEF_60_CNT_SOCIAL_CIRCLE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE','CNT_FAM_MEMBERS']
# print(df_phase3['NAME_TYPE_SUITE'].mode()[0])
# for i in mean_cols:
# column_means = df_phase3[i].mean()
# df_phase3[i] = df_phase3[i].fillna(column_means)
# for i in mode_cols:
# col_mode = df_phase3[i].mode()[0]
# df_phase3[i] = df_phase3[i].fillna(col_mode)
# return df_phase3
def zeros_df(df,per):
dfp3_n0s = pd.DataFrame()
columns = []
percentage =[]
for col in df.columns:
if col == 'TARGET':
continue
count = (df[col] == 0).sum()
columns.append(col)
percentage.append(count/len(df[col]))
dfp3_n0s['Column'] = columns
dfp3_n0s['Percentage'] = percentage
per = per/100
dfp3_n0s = dfp3_n0s[dfp3_n0s['Percentage']>per]
return dfp3_n0s
def to_consider_as_cat(df,thres_val):
return list(get_unique_in_a_column(df,thres_val)['col_name'])
def corr_target(df,cor):
cor_matrix = df.corr()['TARGET'].sort_values(key=abs,ascending=False).reset_index()
cor_matrix.columns = ['col_name','Correlation']
column_after_corr_filter = cor_matrix[abs(cor_matrix['Correlation'])>cor]
return column_after_corr_filter
def missing(df,n):
new_df = missing_per(df).reset_index()
categ_ = []
new_df.columns = ['index','flag']
fin_df = []
for row in new_df.itertuples():
try:
fin_df.append([row.index,row.flag,df[row.index].median(),df[row.index].mean(), df[row.index].nunique()])
except:
fin_df.append([row.index,row.flag,df[row.index].mode(),'NA',df[row.index].nunique()])
cols = ['col_name','percentage_missing','median/Mode','mean','no_of_unique_values']
temp = pd.DataFrame(fin_df,columns=cols)
return temp[temp['percentage_missing']>n]
Columns with more than 90% zero values in them
more_than_90_zero = zeros_df(df,90)
more_than_90_zero
| Column | Percentage | |
|---|---|---|
| 26 | FLAG_EMAIL | 0.943280 |
| 33 | REG_REGION_NOT_LIVE_REGION | 0.984856 |
| 34 | REG_REGION_NOT_WORK_REGION | 0.949231 |
| 35 | LIVE_REGION_NOT_WORK_REGION | 0.959341 |
| 36 | REG_CITY_NOT_LIVE_CITY | 0.921827 |
| 93 | DEF_60_CNT_SOCIAL_CIRCLE | 0.912881 |
| 95 | FLAG_DOCUMENT_2 | 0.999958 |
| 97 | FLAG_DOCUMENT_4 | 0.999919 |
| 98 | FLAG_DOCUMENT_5 | 0.984885 |
| 99 | FLAG_DOCUMENT_6 | 0.911945 |
| 100 | FLAG_DOCUMENT_7 | 0.999808 |
| 101 | FLAG_DOCUMENT_8 | 0.918624 |
| 102 | FLAG_DOCUMENT_9 | 0.996104 |
| 103 | FLAG_DOCUMENT_10 | 0.999977 |
| 104 | FLAG_DOCUMENT_11 | 0.996088 |
| 105 | FLAG_DOCUMENT_12 | 0.999993 |
| 106 | FLAG_DOCUMENT_13 | 0.996475 |
| 107 | FLAG_DOCUMENT_14 | 0.997064 |
| 108 | FLAG_DOCUMENT_15 | 0.998790 |
| 109 | FLAG_DOCUMENT_16 | 0.990072 |
| 110 | FLAG_DOCUMENT_17 | 0.999733 |
| 111 | FLAG_DOCUMENT_18 | 0.991870 |
| 112 | FLAG_DOCUMENT_19 | 0.999405 |
| 113 | FLAG_DOCUMENT_20 | 0.999493 |
| 114 | FLAG_DOCUMENT_21 | 0.999665 |
Dropping the above columns from the dataset
df.drop(columns = more_than_90_zero['Column'],inplace = True)
Columns in training set having more than 30% of missing data, along with their median/mode and unique values in the column
missing(df,30)
| col_name | percentage_missing | median/Mode | mean | no_of_unique_values | |
|---|---|---|---|---|---|
| 0 | COMMONAREA_MEDI | 69.872297 | 0.0208 | 0.0445951 | 3202 |
| 1 | COMMONAREA_MODE | 69.872297 | 0.019 | 0.0425531 | 3128 |
| 2 | COMMONAREA_AVG | 69.872297 | 0.0211 | 0.0446207 | 3181 |
| 3 | NONLIVINGAPARTMENTS_MEDI | 69.432963 | 0 | 0.00865101 | 214 |
| 4 | NONLIVINGAPARTMENTS_AVG | 69.432963 | 0 | 0.00880867 | 386 |
| 5 | NONLIVINGAPARTMENTS_MODE | 69.432963 | 0 | 0.00807639 | 167 |
| 6 | FONDKAPREMONT_MODE | 68.386172 | 0 reg oper account dtype: object | NA | 4 |
| 7 | LIVINGAPARTMENTS_AVG | 68.354953 | 0.0756 | 0.100775 | 1868 |
| 8 | LIVINGAPARTMENTS_MEDI | 68.354953 | 0.0761 | 0.101954 | 1097 |
| 9 | LIVINGAPARTMENTS_MODE | 68.354953 | 0.0771 | 0.105645 | 736 |
| 10 | FLOORSMIN_AVG | 67.848630 | 0.2083 | 0.231894 | 305 |
| 11 | FLOORSMIN_MEDI | 67.848630 | 0.2083 | 0.231625 | 47 |
| 12 | FLOORSMIN_MODE | 67.848630 | 0.2083 | 0.228058 | 25 |
| 13 | YEARS_BUILD_MEDI | 66.497784 | 0.7585 | 0.755746 | 151 |
| 14 | YEARS_BUILD_AVG | 66.497784 | 0.7552 | 0.752471 | 149 |
| 15 | YEARS_BUILD_MODE | 66.497784 | 0.7648 | 0.759637 | 154 |
| 16 | OWN_CAR_AGE | 65.990810 | 9 | 12.0611 | 62 |
| 17 | LANDAREA_MODE | 59.376738 | 0.0458 | 0.0649577 | 3563 |
| 18 | LANDAREA_MEDI | 59.376738 | 0.0487 | 0.0671687 | 3560 |
| 19 | LANDAREA_AVG | 59.376738 | 0.0481 | 0.0663332 | 3527 |
| 20 | BASEMENTAREA_AVG | 58.515956 | 0.0763 | 0.0884422 | 3780 |
| 21 | BASEMENTAREA_MEDI | 58.515956 | 0.0758 | 0.0879549 | 3772 |
| 22 | BASEMENTAREA_MODE | 58.515956 | 0.0746 | 0.0875432 | 3841 |
| 23 | EXT_SOURCE_1 | 56.381073 | 0.505998 | 0.50213 | 114584 |
| 24 | NONLIVINGAREA_MODE | 55.179164 | 0.0011 | 0.0270223 | 3327 |
| 25 | NONLIVINGAREA_MEDI | 55.179164 | 0.0031 | 0.0282359 | 3323 |
| 26 | NONLIVINGAREA_AVG | 55.179164 | 0.0036 | 0.0283578 | 3290 |
| 27 | ELEVATORS_AVG | 53.295980 | 0 | 0.0789415 | 257 |
| 28 | ELEVATORS_MEDI | 53.295980 | 0 | 0.0780778 | 46 |
| 29 | ELEVATORS_MODE | 53.295980 | 0 | 0.0744897 | 26 |
| 30 | WALLSMATERIAL_MODE | 50.840783 | 0 Panel dtype: object | NA | 7 |
| 31 | APARTMENTS_MODE | 50.749729 | 0.084 | 0.114231 | 760 |
| 32 | APARTMENTS_AVG | 50.749729 | 0.0876 | 0.11744 | 2339 |
| 33 | APARTMENTS_MEDI | 50.749729 | 0.0864 | 0.11785 | 1148 |
| 34 | ENTRANCES_MODE | 50.348768 | 0.1379 | 0.145193 | 30 |
| 35 | ENTRANCES_AVG | 50.348768 | 0.1379 | 0.149725 | 285 |
| 36 | ENTRANCES_MEDI | 50.348768 | 0.1379 | 0.149213 | 46 |
| 37 | LIVINGAREA_MODE | 50.193326 | 0.0731 | 0.105975 | 5301 |
| 38 | LIVINGAREA_AVG | 50.193326 | 0.0745 | 0.107399 | 5199 |
| 39 | LIVINGAREA_MEDI | 50.193326 | 0.0749 | 0.108607 | 5281 |
| 40 | HOUSETYPE_MODE | 50.176091 | 0 block of flats dtype: object | NA | 3 |
| 41 | FLOORSMAX_AVG | 49.760822 | 0.1667 | 0.226282 | 403 |
| 42 | FLOORSMAX_MODE | 49.760822 | 0.1667 | 0.222315 | 25 |
| 43 | FLOORSMAX_MEDI | 49.760822 | 0.1667 | 0.225897 | 49 |
| 44 | YEARS_BEGINEXPLUATATION_MODE | 48.781019 | 0.9816 | 0.977065 | 221 |
| 45 | YEARS_BEGINEXPLUATATION_MEDI | 48.781019 | 0.9816 | 0.977752 | 245 |
| 46 | YEARS_BEGINEXPLUATATION_AVG | 48.781019 | 0.9816 | 0.977735 | 285 |
| 47 | TOTALAREA_MODE | 48.268517 | 0.0688 | 0.102547 | 5116 |
| 48 | EMERGENCYSTATE_MODE | 47.398304 | 0 No dtype: object | NA | 2 |
| 49 | OCCUPATION_TYPE | 31.345545 | 0 Laborers dtype: object | NA | 18 |
Segregating the Dataset in numerical and categorical dataframes
df_num, df_cat = seg_num_cat(df)
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy This is separate from the ipykernel package so we can avoid doing imports until
df_num.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | 307511.0 | 278180.518577 | 102790.175348 | 100002.0 | 189145.5 | 278202.0 | 367142.5 | 456255.0 |
| TARGET | 307511.0 | 0.080729 | 0.272419 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| CNT_CHILDREN | 307511.0 | 0.417052 | 0.722121 | 0.0 | 0.0 | 0.0 | 1.0 | 19.0 |
| AMT_INCOME_TOTAL | 307511.0 | 168797.919297 | 237123.146279 | 25650.0 | 112500.0 | 147150.0 | 202500.0 | 117000000.0 |
| AMT_CREDIT | 307511.0 | 599025.999706 | 402490.776996 | 45000.0 | 270000.0 | 513531.0 | 808650.0 | 4050000.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| AMT_REQ_CREDIT_BUREAU_DAY | 265992.0 | 0.007000 | 0.110757 | 0.0 | 0.0 | 0.0 | 0.0 | 9.0 |
| AMT_REQ_CREDIT_BUREAU_WEEK | 265992.0 | 0.034362 | 0.204685 | 0.0 | 0.0 | 0.0 | 0.0 | 8.0 |
| AMT_REQ_CREDIT_BUREAU_MON | 265992.0 | 0.267395 | 0.916002 | 0.0 | 0.0 | 0.0 | 0.0 | 27.0 |
| AMT_REQ_CREDIT_BUREAU_QRT | 265992.0 | 0.265474 | 0.794056 | 0.0 | 0.0 | 0.0 | 0.0 | 261.0 |
| AMT_REQ_CREDIT_BUREAU_YEAR | 265992.0 | 1.899974 | 1.869295 | 0.0 | 0.0 | 1.0 | 3.0 | 25.0 |
81 rows × 8 columns
df_cat.describe().T
| count | unique | top | freq | |
|---|---|---|---|---|
| NAME_CONTRACT_TYPE | 307511 | 2 | Cash loans | 278232 |
| CODE_GENDER | 307511 | 3 | F | 202448 |
| FLAG_OWN_CAR | 307511 | 2 | N | 202924 |
| FLAG_OWN_REALTY | 307511 | 2 | Y | 213312 |
| NAME_TYPE_SUITE | 306219 | 7 | Unaccompanied | 248526 |
| NAME_INCOME_TYPE | 307511 | 8 | Working | 158774 |
| NAME_EDUCATION_TYPE | 307511 | 5 | Secondary / secondary special | 218391 |
| NAME_FAMILY_STATUS | 307511 | 6 | Married | 196432 |
| NAME_HOUSING_TYPE | 307511 | 6 | House / apartment | 272868 |
| OCCUPATION_TYPE | 211120 | 18 | Laborers | 55186 |
| WEEKDAY_APPR_PROCESS_START | 307511 | 7 | TUESDAY | 53901 |
| ORGANIZATION_TYPE | 307511 | 58 | Business Entity Type 3 | 67992 |
| FONDKAPREMONT_MODE | 97216 | 4 | reg oper account | 73830 |
| HOUSETYPE_MODE | 153214 | 3 | block of flats | 150503 |
| WALLSMATERIAL_MODE | 151170 | 7 | Panel | 66040 |
| EMERGENCYSTATE_MODE | 161756 | 2 | No | 159428 |
Numerical Columns and their correlation with the TARGET column in descending order
corr_target(df_num,0.00)
| col_name | Correlation | |
|---|---|---|
| 0 | TARGET | 1.000000 |
| 1 | EXT_SOURCE_3 | -0.178919 |
| 2 | EXT_SOURCE_2 | -0.160472 |
| 3 | EXT_SOURCE_1 | -0.155317 |
| 4 | DAYS_BIRTH | 0.078239 |
| ... | ... | ... |
| 76 | NONLIVINGAPARTMENTS_MODE | -0.001557 |
| 77 | AMT_REQ_CREDIT_BUREAU_HOUR | 0.000930 |
| 78 | AMT_REQ_CREDIT_BUREAU_WEEK | 0.000788 |
| 79 | FLAG_MOBIL | 0.000534 |
| 80 | FLAG_CONT_MOBILE | 0.000370 |
81 rows × 2 columns
The Columns named "NAME_FAMILY_STATUS, CODE_GENDER, NAME_INCOME_TYPE"does not have values 'Unknown','XNA' and 'Maternity Leave' in the test dataset thus these rows are removed from the training dataset and there are a total of 11 rows that are removed.
df =df[df['NAME_FAMILY_STATUS']!='Unknown']
df =df[df['CODE_GENDER']!='XNA']
df =df[df['NAME_INCOME_TYPE']!='Maternity leave']
Considering Columns that have more than 2% correlation with the TARGET variable
temp_df = corr_target(df_num,0.02)
temp_df
| col_name | Correlation | |
|---|---|---|
| 0 | TARGET | 1.000000 |
| 1 | EXT_SOURCE_3 | -0.178919 |
| 2 | EXT_SOURCE_2 | -0.160472 |
| 3 | EXT_SOURCE_1 | -0.155317 |
| 4 | DAYS_BIRTH | 0.078239 |
| 5 | REGION_RATING_CLIENT_W_CITY | 0.060893 |
| 6 | REGION_RATING_CLIENT | 0.058899 |
| 7 | DAYS_LAST_PHONE_CHANGE | 0.055218 |
| 8 | DAYS_ID_PUBLISH | 0.051457 |
| 9 | REG_CITY_NOT_WORK_CITY | 0.050994 |
| 10 | FLAG_EMP_PHONE | 0.045982 |
| 11 | DAYS_EMPLOYED | -0.044932 |
| 12 | FLAG_DOCUMENT_3 | 0.044346 |
| 13 | FLOORSMAX_AVG | -0.044003 |
| 14 | FLOORSMAX_MEDI | -0.043768 |
| 15 | FLOORSMAX_MODE | -0.043226 |
| 16 | DAYS_REGISTRATION | 0.041975 |
| 17 | AMT_GOODS_PRICE | -0.039645 |
| 18 | OWN_CAR_AGE | 0.037612 |
| 19 | REGION_POPULATION_RELATIVE | -0.037227 |
| 20 | ELEVATORS_AVG | -0.034199 |
| 21 | ELEVATORS_MEDI | -0.033863 |
| 22 | FLOORSMIN_AVG | -0.033614 |
| 23 | FLOORSMIN_MEDI | -0.033394 |
| 24 | LIVINGAREA_AVG | -0.032997 |
| 25 | LIVINGAREA_MEDI | -0.032739 |
| 26 | FLOORSMIN_MODE | -0.032698 |
| 27 | TOTALAREA_MODE | -0.032596 |
| 28 | LIVE_CITY_NOT_WORK_CITY | 0.032518 |
| 29 | DEF_30_CNT_SOCIAL_CIRCLE | 0.032248 |
| 30 | ELEVATORS_MODE | -0.032131 |
| 31 | LIVINGAREA_MODE | -0.030685 |
| 32 | AMT_CREDIT | -0.030369 |
| 33 | APARTMENTS_AVG | -0.029498 |
| 34 | APARTMENTS_MEDI | -0.029184 |
| 35 | FLAG_WORK_PHONE | 0.028524 |
| 36 | APARTMENTS_MODE | -0.027284 |
| 37 | LIVINGAPARTMENTS_AVG | -0.025031 |
| 38 | LIVINGAPARTMENTS_MEDI | -0.024621 |
| 39 | HOUR_APPR_PROCESS_START | -0.024166 |
| 40 | FLAG_PHONE | -0.023806 |
| 41 | LIVINGAPARTMENTS_MODE | -0.023393 |
| 42 | BASEMENTAREA_AVG | -0.022746 |
| 43 | YEARS_BUILD_MEDI | -0.022326 |
| 44 | YEARS_BUILD_AVG | -0.022149 |
| 45 | BASEMENTAREA_MEDI | -0.022081 |
| 46 | YEARS_BUILD_MODE | -0.022068 |
Out of these columns we first need to deal with missing values of the columns. Thus we get all columns which have missing values
temp_df_missing = missing(df[temp_df['col_name']],0)
Here in cols_with_no_missing we keep columns not having missing values
cols_with_no_missing = list(set(temp_df.col_name).difference(set(temp_df_missing.col_name)))
We set a threshold of 10 for unique values in a column, if it is 10 or less we do not consider them as numerical but discrete and replace the missing values with the mode of the column
less_than_10 = temp_df_missing[temp_df_missing['no_of_unique_values']<=10]
temp_df_missing = temp_df_missing[temp_df_missing['no_of_unique_values']>10]
for i in less_than_10.col_name:
df_num[i] = df_num[i].fillna(df_num[i].mode()[0])
for i in less_than_10.col_name:
df_test[i] = df_test[i].fillna(df_test[i].mode()[0])
Finally, we get all the column names for the numerical columns that we will consider in our modeling and also imputation
final_cols_in_num = list(less_than_10.col_name) + list(temp_df_missing.col_name) +cols_with_no_missing
This gives us the final numerical dataframe
df_num = df_num[final_cols_in_num]
Here, we get the missing value of columns for the categorical columns, we see that out of 16 columns we get 6 columns that have missing values
df_temp = missing(df_cat,0)
df_temp
| col_name | percentage_missing | median/Mode | mean | no_of_unique_values | |
|---|---|---|---|---|---|
| 0 | FONDKAPREMONT_MODE | 68.386172 | 0 reg oper account dtype: object | NA | 4 |
| 1 | WALLSMATERIAL_MODE | 50.840783 | 0 Panel dtype: object | NA | 7 |
| 2 | HOUSETYPE_MODE | 50.176091 | 0 block of flats dtype: object | NA | 3 |
| 3 | EMERGENCYSTATE_MODE | 47.398304 | 0 No dtype: object | NA | 2 |
| 4 | OCCUPATION_TYPE | 31.345545 | 0 Laborers dtype: object | NA | 18 |
| 5 | NAME_TYPE_SUITE | 0.420148 | 0 Unaccompanied dtype: object | NA | 7 |
Getting the counts of each category in these columns
Here we observe that for columns :
1)FONDKAPREMONT_MODE there is 68% missing data which makes it not ideal to impute missing values with the mode
2)WALLSMATERIAL_MODE, here the difference between the first and the second most occuring values is not much thus imputation with Mode is ambiguous
3)OCCUPATION_TYPE, this is a column that has 31% missing value and again there can not be one specific value that can be decided to impute the missing data with
Thus we remove these columns from the categorical part of the dataframe
cat_cols_to_rem = ['FONDKAPREMONT_MODE','WALLSMATERIAL_MODE','OCCUPATION_TYPE']
df_cat.drop(columns = cat_cols_to_rem,inplace=True)
Visualizing the categorical columns to understand the data more efficiently
def col(cat):
plt.figure(figsize=(10,10))
plt.title("Loan Default with respect to "+cat,fontweight='bold' , fontsize =16)
sns.countplot(x=df[cat],hue='TARGET',data=df, palette = 'Blues')
plt.xticks(rotation=90)
How is the distribution of of loan according to gender?
print(df_cat['CODE_GENDER'].value_counts())
sns.countplot(df_cat['CODE_GENDER'], palette = 'Oranges')
plt.title("Percentage of loan with reference to gender", fontweight = 'bold', fontsize = 16)
F 202448 M 105059 XNA 4 Name: CODE_GENDER, dtype: int64
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning
Text(0.5, 1.0, 'Percentage of loan with reference to gender')
Inference: The number of female borrowing the loan and who haven't paid is comparatively higher than men.
What is the marital status of client?
print(df_cat['NAME_FAMILY_STATUS'].value_counts())
sns.countplot(df_cat['NAME_FAMILY_STATUS'], palette = 'Purples')
plt.title("Family Status vs Count", fontweight = 'bold', fontsize = 11)
Married 196432 Single / not married 45444 Civil marriage 29775 Separated 19770 Widow 16088 Unknown 2 Name: NAME_FAMILY_STATUS, dtype: int64
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning
Text(0.5, 1.0, 'Family Status vs Count')
Inference: The majority of client who are Married have paid the least loan amount while the status of unknown is negligible.
How many percent of client own a car?
print(df_cat['FLAG_OWN_CAR'].value_counts())
sns.countplot(df_cat['FLAG_OWN_CAR'], palette = 'Oranges')
plt.title("Percentage of car owners in the dataset", fontweight = 'bold', fontsize = 11)
N 202924 Y 104587 Name: FLAG_OWN_CAR, dtype: int64
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning
Text(0.5, 1.0, 'Percentage of car owners in the dataset')
Inference: About 50 % of people own's a car, but there's majority of client (more than 50%) who doesn't possess a car and most of them are likely who haven't paid the loan.
What type of educational background does the clients have?
print(df_cat['NAME_EDUCATION_TYPE'].value_counts())
sns.countplot(df_cat['NAME_EDUCATION_TYPE'])
plt.title("Education type vs count")
plt.xticks(rotation=90)
Secondary / secondary special 218391 Higher education 74863 Incomplete higher 10277 Lower secondary 3816 Academic degree 164 Name: NAME_EDUCATION_TYPE, dtype: int64
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning
(array([0, 1, 2, 3, 4]), <a list of 5 Text major ticklabel objects>)
col('NAME_EDUCATION_TYPE')
Inference: Clients with Academic Degree are more likely to repay the loan compared to others.
What are the types of housing does the clients stay in?
plt.figure(figsize=[20,15])
plt.pie(df_cat['NAME_HOUSING_TYPE'].value_counts(),labels = df_cat['NAME_HOUSING_TYPE'].value_counts().index,autopct='%1.1f%%')
my_circle=plt.Circle( (0,0), 0.5, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
# plt.title('Percentage of User Types')
plt.show()
#plt.ticks(rotation=90)
col("NAME_HOUSING_TYPE")
Inference: From the above graphical presentation, we can see that majority of the clients stay in apartment/House haven't paid the loan amount, while the least amount of them stay in office apartment and co-op apartment are negligible.
What are income type of applicant in terms of loan does the clients have?
col("NAME_INCOME_TYPE")
Inference: All the Students and Businessman are neglibile, here we can see that majority of working people are hardly paying the loan.
print(df_cat['WEEKDAY_APPR_PROCESS_START'].value_counts())
sns.countplot(df_cat['WEEKDAY_APPR_PROCESS_START'])
plt.title("Number of loan approval process vs day ", fontweight = 'bold', fontsize = 12)
TUESDAY 53901 WEDNESDAY 51934 MONDAY 50714 THURSDAY 50591 FRIDAY 50338 SATURDAY 33852 SUNDAY 16181 Name: WEEKDAY_APPR_PROCESS_START, dtype: int64
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning
Text(0.5, 1.0, 'Number of loan approval process vs day ')
Inference: The loan approval process has the highest count starting tuesday ,while the lowest count can be clearly seen on the weekends.
What type of loan are available?
print(df_cat['NAME_CONTRACT_TYPE'].value_counts())
sns.countplot(df_cat['NAME_CONTRACT_TYPE'], palette = 'Blues')
plt.title("Types of loan available", fontweight = 'bold', fontsize = 12)
Cash loans 278232 Revolving loans 29279 Name: NAME_CONTRACT_TYPE, dtype: int64
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning
Text(0.5, 1.0, 'Types of loan available')
Inference: Many people are willing to take cash loan than revolving loan
Here, we plot some graphs for the columns with highest correlations with the Target variable and observe the trend with respect to the target variable.
#Numerical Plot
plt.figure(figsize=[14,8])
sns.lineplot(x='TARGET',y='EXT_SOURCE_3',data=df,color= 'orange')
plt.title("EXT_SOURCE_3 vs TARGET", fontweight = 'bold', fontsize = 16)
Text(0.5, 1.0, 'EXT_SOURCE_3 vs TARGET')
plt.figure(figsize=[14,8])
sns.lineplot(x='TARGET',y='EXT_SOURCE_2',data=df,marker='*',color= 'blue')
plt.title("EXT_SOURCE_2 vs TARGET", fontweight = 'bold', fontsize = 16)
Text(0.5, 1.0, 'EXT_SOURCE_2 vs TARGET')
plt.figure(figsize=[14,8])
sns.lineplot(x='TARGET',y='EXT_SOURCE_1',data=df,marker='*',color= 'red')
plt.title("EXT_SOURCE_1 vs TARGET", fontweight = 'bold', fontsize = 16)
Text(0.5, 1.0, 'EXT_SOURCE_1 vs TARGET')
Inference:We see these for the columns EXT_SOURCE_3, EXT_SOURCE_2 and EXT_SOURCE_1 and can observe a clear strong negative correlation
What type of correlation does the columns DAYS_BIRTH and DAY_LAST_PHONE_CHANGE have with respect to target?
plt.figure(figsize=[14,8])
sns.lineplot(x='TARGET',y='DAYS_BIRTH',data=df,marker='*',color= 'orange')
plt.title("DAYS_BIRTH vs TARGET", fontweight = 'bold', fontsize = 16)
Text(0.5, 1.0, 'DAYS_BIRTH vs TARGET')
plt.figure(figsize=[14,8])
sns.lineplot(x='TARGET',y='DAYS_LAST_PHONE_CHANGE',data=df,marker='*',color= 'black')
plt.title("DAYS_LAST_PHONE_CHANGE vs TARGET", fontweight = 'bold', fontsize = 16)
Text(0.5, 1.0, 'DAYS_LAST_PHONE_CHANGE vs TARGET')
Inference: From the above plots,we can see that for columns of DAYS_BIRTH and DAYS_LAST_PHONE_CHANGE we see a strong positive correlation with respect to the target.
In this project, we are creating three pipelines, one for numerical data, one for categorical data and finally a pipeline to combine the data.
(i).Numerical data pipeline: For the pipeline with numerical data also called ‘num_pipeline’, we impute the missing values by the mean of the columns.
(ii). Categorical data pipeline: For the pipeline with categorical data also called ‘cat_pipeline’, we impute the missing values by the mode or the most frequent data.
(iii)Final pipeline: We create a pipeline to merge the numerical and categorical columns that have no missing values. The categorical columns are also one hot encoded.
Importing necessary packages
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
Finally selecting only the columns that we have finally decided for the numerical and the categorical part
df_num.drop(columns = ['TARGET'],inplace=True)
df_cols = list(df_num.columns)+ list(df_cat.columns)
df_i = df[df_cols]
Making two pipelines one for the numerical data where we impute the missing values by the mean of the columns, and the other for categorical data where we impute the missing data in the categorical columns using the Mode or the most frequent data.
num_pipeline = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean'))])
cat_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
Here we create a pipeline to merge the numerical and categorical columns that have no missing values and the categorical columns are one hot encoded.
data_pipeline = ColumnTransformer([
("num_pipeline", num_pipeline, df_num.columns),
("cat_pipeline", cat_pipeline, df_cat.columns)], n_jobs = -1)
df_transformed = data_pipeline.fit_transform(df_i)
column_names = list(df_num.columns) + \
list(data_pipeline.transformers_[1][1].named_steps["ohe"].get_feature_names(df_cat.columns))
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead. warnings.warn(msg, category=FutureWarning)
This is the final dataset that we get for training our model
df_n = pd.DataFrame(df_transformed, columns=column_names)
from sklearn.metrics import log_loss
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
from sklearn import metrics
models_results = []
X = df_n.values
y = df['TARGET'].values
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.15, random_state=42)
Here we will be using the lbfgs solver which is a Limited-memory BFGS (L-BFGS or LM-BFGS) optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory
pipe = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression(solver='lbfgs', max_iter=1000))
])
pipe.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()),
('classifier', LogisticRegression(max_iter=1000))])
print('Training set accuracy score: ' + str(pipe.score(X_train,y_train)))
y_pred = pipe.predict(X_valid)
print('Validation set accuracy score: ' + str(accuracy_score(y_valid,y_pred)))
print('Log loss: ',log_loss(y_valid,y_pred))
print('Confusion Matrix: ','\n',confusion_matrix(y_valid, y_pred))
print('ROC_AUC: ',roc_auc_score(y_valid, pipe.predict_proba(X_valid)[:, 1]))
Training set accuracy score: 0.9192998565279771 Validation set accuracy score: 0.9185474254742547 Log loss: 2.8132732637903612 Confusion Matrix: [[42322 58] [ 3699 46]] ROC_AUC: 0.7357038265902436
models_results.append(['Logistic Regression',pipe.score(X_train,y_train),accuracy_score(y_valid,y_pred)])
metrics.plot_roc_curve(pipe, X_valid, y_valid)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_roc_curve is deprecated; Function `plot_roc_curve` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: RocCurveDisplay.from_predictions or RocCurveDisplay.from_estimator. warnings.warn(msg, category=FutureWarning)
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x7f92a4923a90>
pipe_naive_bayes = Pipeline([
('scaler', StandardScaler()),
('classifier', GaussianNB())
])
pipe_naive_bayes.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()), ('classifier', GaussianNB())])
print('Training set accuracy score: ' + str(pipe_naive_bayes.score(X_train,y_train)))
y_pred = pipe_naive_bayes.predict(X_valid)
print('Validation set accuracy score: ' + str(accuracy_score(y_valid,y_pred)))
print('Log loss: ',log_loss(y_valid,y_pred))
print('Confusion Matrix: ','\n',confusion_matrix(y_valid, y_pred))
print('ROC_AUC: ',roc_auc_score(y_valid, pipe_naive_bayes.predict_proba(X_valid)[:, 1]))
Training set accuracy score: 0.174779531324725 Validation set accuracy score: 0.17394037940379403 Log loss: 28.531744445893832 Confusion Matrix: [[ 4543 37837] [ 265 3480]] ROC_AUC: 0.6249500419310065
models_results.append(['Naive Bayes',pipe_naive_bayes.score(X_train,y_train),accuracy_score(y_valid,y_pred)])
metrics.plot_roc_curve(pipe_naive_bayes, X_valid, y_valid)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_roc_curve is deprecated; Function `plot_roc_curve` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: RocCurveDisplay.from_predictions or RocCurveDisplay.from_estimator. warnings.warn(msg, category=FutureWarning)
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x7f9297e8a090>
pipe_rf = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier())
])
pipe_rf.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()),
('classifier', RandomForestClassifier())])
print('Training set accuracy score: ' + str(pipe_rf.score(X_train,y_train)))
y_pred = pipe_rf.predict(X_valid)
print('Validation set accuracy score: ' + str(accuracy_score(y_valid,y_pred)))
print('Log loss: ',log_loss(y_valid,y_pred))
print('Confusion Matrix: ','\n',confusion_matrix(y_valid, y_pred))
print('ROC_AUC: ',roc_auc_score(y_valid, pipe_rf.predict_proba(X_valid)[:, 1]))
Training set accuracy score: 0.9999808703969393 Validation set accuracy score: 0.9189593495934959 Log loss: 2.799044972630154 Confusion Matrix: [[42376 4] [ 3734 11]] ROC_AUC: 0.7072585942811274
models_results.append(['Random Forest Classifier',pipe_rf.score(X_train,y_train),accuracy_score(y_valid,y_pred)])
metrics.plot_roc_curve(pipe_rf, X_valid, y_valid)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_roc_curve is deprecated; Function `plot_roc_curve` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: RocCurveDisplay.from_predictions or RocCurveDisplay.from_estimator. warnings.warn(msg, category=FutureWarning)
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x7f9297dabfd0>
pd.DataFrame(models_results,columns = ['Model Name', 'Training Accuracy', 'Validation Accuracy'])
| Model Name | Training Accuracy | Validation Accuracy | |
|---|---|---|---|
| 0 | Logistic Regression | 0.919300 | 0.918547 |
| 1 | Naive Bayes | 0.174780 | 0.173940 |
| 2 | Random Forest Classifier | 0.999981 | 0.918959 |
Here we see that Random Forest Classifier has the best training accuracy of around 99% but an accuracy like this runs a risk of overfitting. The logistic Regression Model also gives an accuracy of 91% which is descent enough and seems to be a consdierable model. The ROC Area Under Curve value for Random Forest Classifier and the Logistic regression models are 0.704 and 0.735 respectively and both of them show a significant amount of True Positive values which shows that this is a good model fit. We cannot consider Naive Bayes as our model as it is evidently underfitting the data. We need to check how Random Forest classifier and the Logistic Regression model works on the test data to confirm if Random forest is really overfitting.
X = df_n.values
y = df['TARGET'].values
In the following pipeline we use standard scaler to normalize the data with mean being zero and the standard deviation being 1, and use Logistic Regression as our modeling algorithm with solver being lbfgs and interations for convergence being 1000
pipe = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression(solver='lbfgs', max_iter=1000))
])
pipe.fit(X, y)
print('Training set score: ' + str(pipe.score(X,y)))
Training set score: 0.919190243902439
pipe_rf = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier())
])
pipe_rf.fit(X, y)
Pipeline(steps=[('scaler', StandardScaler()),
('classifier', RandomForestClassifier())])
print('Training set score: ' + str(pipe_rf.score(X_train,y_train)))
Training set score: 0.9999464371114299
Here we get the same final columns as for the training dataset in the test dataset
df_test_n= df_test[df_cols]
df_test_n
| DEF_30_CNT_SOCIAL_CIRCLE | LIVINGAPARTMENTS_MEDI | LIVINGAPARTMENTS_AVG | LIVINGAPARTMENTS_MODE | FLOORSMIN_MEDI | FLOORSMIN_AVG | FLOORSMIN_MODE | YEARS_BUILD_MODE | YEARS_BUILD_MEDI | YEARS_BUILD_AVG | OWN_CAR_AGE | BASEMENTAREA_MEDI | BASEMENTAREA_AVG | EXT_SOURCE_1 | ELEVATORS_MODE | ELEVATORS_AVG | ELEVATORS_MEDI | APARTMENTS_MODE | APARTMENTS_MEDI | APARTMENTS_AVG | LIVINGAREA_AVG | LIVINGAREA_MODE | LIVINGAREA_MEDI | FLOORSMAX_AVG | FLOORSMAX_MODE | FLOORSMAX_MEDI | TOTALAREA_MODE | EXT_SOURCE_3 | EXT_SOURCE_2 | AMT_GOODS_PRICE | DAYS_LAST_PHONE_CHANGE | DAYS_BIRTH | REGION_RATING_CLIENT_W_CITY | REGION_POPULATION_RELATIVE | AMT_CREDIT | FLAG_EMP_PHONE | REG_CITY_NOT_WORK_CITY | FLAG_WORK_PHONE | LIVE_CITY_NOT_WORK_CITY | DAYS_EMPLOYED | FLAG_DOCUMENT_3 | DAYS_ID_PUBLISH | HOUR_APPR_PROCESS_START | DAYS_REGISTRATION | FLAG_PHONE | REGION_RATING_CLIENT | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | WEEKDAY_APPR_PROCESS_START | ORGANIZATION_TYPE | HOUSETYPE_MODE | EMERGENCYSTATE_MODE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0590 | 0.0590 | 0.752614 | NaN | NaN | NaN | 0.0672 | 0.0666 | 0.0660 | 0.0505 | 0.0526 | 0.0514 | 0.1250 | 0.1250 | 0.1250 | 0.0392 | 0.159520 | 0.789654 | 450000.0 | -1740.0 | -19241 | 2 | 0.018850 | 568800.0 | 1 | 0 | 0 | 0 | -2329 | 1 | -812 | 18 | -5170.0 | 0 | 2 | Cash loans | F | N | Y | Unaccompanied | Working | Higher education | Married | House / apartment | TUESDAY | Kindergarten | block of flats | No |
| 1 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.564990 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.432962 | 0.291656 | 180000.0 | 0.0 | -18064 | 2 | 0.035792 | 222768.0 | 1 | 0 | 0 | 0 | -4469 | 1 | -1623 | 9 | -9118.0 | 0 | 2 | Cash loans | M | N | Y | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | FRIDAY | Self-employed | NaN | NaN |
| 2 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 5.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.610991 | 0.699787 | 630000.0 | -856.0 | -20038 | 2 | 0.019101 | 663264.0 | 1 | 0 | 0 | 0 | -4458 | 0 | -3503 | 14 | -2175.0 | 0 | 2 | Cash loans | M | Y | Y | NaN | Working | Higher education | Married | House / apartment | MONDAY | Transport: type 3 | NaN | NaN |
| 3 | 0.0 | 0.2446 | 0.2404 | 0.2626 | 0.0417 | 0.0417 | 0.0417 | 0.9608 | 0.9597 | 0.9592 | NaN | 0.1974 | 0.1974 | 0.525734 | 0.3222 | 0.32 | 0.32 | 0.3109 | 0.3081 | 0.3052 | 0.3673 | 0.3827 | 0.3739 | 0.3750 | 0.3750 | 0.3750 | 0.3700 | 0.612704 | 0.509677 | 1575000.0 | -1805.0 | -13976 | 2 | 0.026392 | 1575000.0 | 1 | 0 | 0 | 0 | -1866 | 1 | -4208 | 11 | -2000.0 | 1 | 2 | Cash loans | F | N | Y | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | WEDNESDAY | Business Entity Type 3 | block of flats | No |
| 4 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 16.0 | NaN | NaN | 0.202145 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.425687 | 625500.0 | -821.0 | -13040 | 2 | 0.010032 | 625500.0 | 1 | 1 | 1 | 1 | -2191 | 1 | -4262 | 5 | -4000.0 | 0 | 2 | Cash loans | M | Y | N | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | FRIDAY | Business Entity Type 3 | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 48739 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.643026 | 0.648575 | 270000.0 | -684.0 | -19970 | 3 | 0.002042 | 412560.0 | 1 | 0 | 1 | 0 | -5169 | 0 | -3399 | 16 | -9094.0 | 1 | 3 | Cash loans | F | N | Y | Unaccompanied | Working | Secondary / secondary special | Widow | House / apartment | WEDNESDAY | Other | NaN | NaN |
| 48740 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.684596 | 495000.0 | 0.0 | -11186 | 2 | 0.035792 | 622413.0 | 1 | 1 | 0 | 1 | -1149 | 1 | -3003 | 11 | -3015.0 | 0 | 2 | Cash loans | F | N | N | Unaccompanied | Commercial associate | Secondary / secondary special | Married | House / apartment | MONDAY | Trade: type 7 | NaN | NaN |
| 48741 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | 0.1364 | 0.1364 | 0.733503 | 0.1611 | 0.16 | 0.16 | 0.1134 | 0.1124 | 0.1113 | 0.1383 | 0.1441 | 0.1408 | 0.3333 | 0.3333 | 0.3333 | 0.1663 | 0.283712 | 0.632770 | 315000.0 | -838.0 | -15922 | 2 | 0.026392 | 315000.0 | 1 | 0 | 0 | 0 | -3037 | 1 | -1504 | 12 | -2681.0 | 1 | 2 | Cash loans | F | Y | Y | Unaccompanied | Commercial associate | Secondary / secondary special | Married | House / apartment | WEDNESDAY | Business Entity Type 3 | block of flats | No |
| 48742 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0723 | 0.0723 | 0.373090 | 0.1611 | 0.16 | 0.16 | 0.1660 | 0.1645 | 0.1629 | 0.1563 | 0.1204 | 0.1591 | 0.6250 | 0.6250 | 0.6250 | 0.1974 | 0.595456 | 0.445701 | 450000.0 | -2308.0 | -13968 | 2 | 0.018850 | 450000.0 | 1 | 1 | 1 | 1 | -2731 | 1 | -1364 | 10 | -1461.0 | 1 | 2 | Cash loans | M | N | N | Family | Commercial associate | Higher education | Married | House / apartment | MONDAY | Self-employed | block of flats | No |
| 48743 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 22.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.272134 | 0.456541 | 270000.0 | -327.0 | -13962 | 2 | 0.006629 | 312768.0 | 1 | 0 | 1 | 0 | -633 | 1 | -4220 | 14 | -1072.0 | 0 | 2 | Cash loans | F | Y | N | Unaccompanied | Working | Secondary / secondary special | Married | House / apartment | TUESDAY | Government | NaN | NaN |
48744 rows × 59 columns
# df_num.drop(columns = ['TARGET'],inplace=True)
df_transformed = data_pipeline.fit_transform(df_test)
column_names = list(df_num.columns) + \
list(data_pipeline.transformers_[1][1].named_steps["ohe"].get_feature_names(df_cat.columns))
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead. warnings.warn(msg, category=FutureWarning)
df_transformed.shape
(48744, 154)
df_n_test = pd.DataFrame(df_transformed, columns=column_names)
X = df_n_test.values
result = pipe_rf.predict(X)
result_prob = pipe_rf.predict_proba(X)
r = pd.DataFrame(result,columns=['result'])
r[['class_0_prob','class_1_prob']] = result_prob
final_sub = pd.DataFrame()
final_sub['SK_ID_CURR'] = df_test['SK_ID_CURR']
final_sub['TARGET'] = r['class_1_prob']
final_sub = final_sub.set_index('SK_ID_CURR')
final_sub
| TARGET | |
|---|---|
| SK_ID_CURR | |
| 100001 | 0.08 |
| 100005 | 0.12 |
| 100013 | 0.06 |
| 100028 | 0.01 |
| 100038 | 0.07 |
| ... | ... |
| 456221 | 0.02 |
| 456222 | 0.08 |
| 456223 | 0.02 |
| 456224 | 0.06 |
| 456250 | 0.11 |
48744 rows × 1 columns
final_sub.to_csv('submission.csv')
For Logistic Regression
For Random Forest Classifier
With high training accuracy and a low test accuracy we can see that the Random forest classifier seems to be over fitting and thus the ideal choice of model would be Logistic Regression
In this phase, we actually start playing with the dataset. It included understanding the dataset, what information we are presented with and then cleaning the data accordingly. We picked only the features which were important to the target variable and prediction.We featured the data performing OHE and applied imputing methods to fix the data before feeding it to the model. We were able to create the baseline pipeline and could experimentally understand the accuracies of the models like logistic regression, naive bayes and Random forest. Based on the results of the models we saw that there might be underfitting in naive bayes and overfitting in RandomForest.The best model that we could get for Phase 0 was Logistic Regression which gave a training accuracy of 91.9% and the kaggle submission accuracy of 73.6%. Further we are planning to improvise the feature engineering, perform hyperparameter tuning for our models alongside using K-Fold cross validation and GridSearchCV, we might also use some advanced gradient boosting models so that we could get as close to the best accuracy as we can. After the aforementioned step we also plan to apply Deep Learning techniques like developing Artificial Neural Networks for better prediction results.